Clustering Document with Active Learning using Wikipedia
ثبت نشده
چکیده
Wikipedia has been applied as a background knowledge base to various text mining problems, including document categorization, topic indexing and information extraction. However, very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit Wikipedia and the semantic knowledge therein to facilitate clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. We first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. We then exploit the semantic relatedness between Wikipedia concepts to find pair-wise instancelevel constraints for supervising clustering, guiding clustering towards the direction indicated by the constraints. We test our approach on three standard text document datasets. Empirical results show that our basic document representation strategy yields comparable performance to previous attempts. Adding constraints improves clustering performance further by up to 20%.
منابع مشابه
Frequent Itemset Based Hierarchical Document Clustering Using Wikipedia as External Knowledge
High dimensionality is a major challenge in document clustering. Some of the recent algorithms address this problem by using frequent itemsets for clustering. But, most of these algorithms neglect the semantic relationship between the words. On the other hand there are algorithms that take care of the semantic relations between the words by making use of external knowledge contained in WordNet,...
متن کاملHigh-Dimensional Unsupervised Active Learning Method
In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...
متن کاملUnsupervised Document Classification with Informed Topic Models
Document classification is an important and common application in natural language processing. Scaling classification approaches to many targets faces a bottleneck in acquiring gold standard labels. In this work, we develop and evaluate a method for using informed topic models to noisily label documents, creating a noisy but usable set of labels for training discriminative classifiers. We inves...
متن کاملMultilingual Document Clustering Using Wikipedia as External Knowledge
This paper presents Multilingual Document Clustering (MDC) on comparable corpora. Wikipedia, a structured multilingual knowledge base, has been highly exploited in many monolingual clustering approaches and also in comparing multilingual corpora. But there is no prior work which studied the impact of Wikipedia on MDC. Here, we have made an in-depth study on availing Wikipedia in enhancing MDC p...
متن کاملTagging Scientific Publications Using Wikipedia and Natural Language Processing Tools - Comparison on the ArXiv Dataset
In this work, we compare two simple methods of tagging scientific publications with labels reflecting their content. As a first source of labels Wikipedia is employed, second label set is constructed from the noun phrases occurring in the analyzed corpus. We examine the statistical properties and the effectiveness of both approaches on the dataset consisting of abstracts from 0.7 million of sci...
متن کامل